Skip to content
This repository has been archived by the owner on Jun 26, 2020. It is now read-only.

Many multi-value returns #1147

Merged
merged 30 commits into from Nov 5, 2019

Conversation

fitzgen
Copy link
Member

@fitzgen fitzgen commented Oct 15, 2019

Do not merge! This is a work-in-progress! EDIT: ready for review now :)

I'm working on supporting arbitrarily many return values (although we will need to enforce some reasonable upper bound) in cranelift functions, and therefore in Wasm as well.

The approach I've been trying is to:

  • Make use of the sret special argument for passing in a return pointer, pointing to space allocated in the caller's stack frame for the return values.

  • Introduce new ArgumentLocation and ValueLocation variants for essentially "offset x from this function's sret parameter": *(sret + x).

  • During legalization, we rewrite signatures, calls, and returns to introduce usage of the sret return pointer parameter. Specifically,

    • legalize_signature will add the sret parameter and assign offsets for each return value (currently implemented),

    • handle_call_abi will rewrite multi-value calls to allocate a stack slot for the return values, pass a pointer to that space as the sret argument, and then load the return values out of the stack slot (not yet implemented),

    • and handle_return_abi will rewrite multi-value returns to store the values into the sret return pointer at the proper offsets, and return the sret return pointer in a register.

This almost works! However, I'm running into a little trouble during register allocation.

Consider this function that returns four i64s:

function %return_4_i64s() -> i64, i64, i64, i64 fast {
ebb0:
    v0 = iconst.i64 0
    v1 = iconst.i64 1
    v2 = iconst.i64 2
    v3 = iconst.i64 3
    return v0, v1, v2, v3
}

After legalization, we see that an sret parameter has been added, that all the returns are assigned to offsets from the sret, and that we are storing the return values at those offsets just before returning. Great!

function %return_4_i64s(i64 sret [%rdi]) -> i64 [sret(0x0)], i64 [sret(0x8)], i64 [sret(0x10)], i64 [sret(0x18)], i64 sret [%rax] fast {
                                ebb0(v4: i64):
[RexOp1pu_id#b8,sret(0)]            v0 = iconst.i64 0
[RexOp1pu_id#b8,sret(8)]            v1 = iconst.i64 1
[RexOp1pu_id#b8,sret(16)]           v2 = iconst.i64 2
[RexOp1pu_id#b8,sret(24)]           v3 = iconst.i64 3
[RexOp1st#8089]                     store notrap aligned v0, v4
[RexOp1stDisp8#8089]                store notrap aligned v1, v4+8
[RexOp1stDisp8#8089]                store notrap aligned v2, v4+16
[RexOp1stDisp8#8089]                store notrap aligned v3, v4+24
[Op1ret#c3]                         return v0, v1, v2, v3, v4
}

However, when register allocation comes along, it unnecessarily inserts an instruction to move v0 from %rax to %rsi. I'm a bit confused why this is happening, since I update the ValueLocation to an offset from the sret, and it shouldn't even think that v0 is live in %rax! Anyways, even if this unnecessary instruction were included, things would otherwise work, except that the register allocator also recorded the new location of v0 as %rsi, which means that now the verifier complains that teh arguments are in the wrong places!

function %return_4_i64s(i64 sret [%rdi]) -> i64 [sret(0x0)], i64 [sret(0x8)], i64 [sret(0x10)], i64 [sret(0x18)], i64 sret [%rax] fast {
                                ebb0(v4: i64 [%rdi]):
[RexOp1pu_id#b8,%rax]               v0 = iconst.i64 0
[RexOp1pu_id#b8,%rcx]               v1 = iconst.i64 1
[RexOp1pu_id#b8,%rdx]               v2 = iconst.i64 2
[RexOp1pu_id#b8,%rbx]               v3 = iconst.i64 3
[RexOp1st#8089]                     store notrap aligned v0, v4
[RexOp1stDisp8#8089]                store notrap aligned v1, v4+8
[RexOp1stDisp8#8089]                store notrap aligned v2, v4+16
[RexOp1stDisp8#8089]                store notrap aligned v3, v4+24
[RexOp1rmov#8089]                   regmove v0, %rax -> %rsi
[RexOp1rmov#8089]                   regmove v4, %rdi -> %rax
[Op1ret#c3]                         return v0, v1, v2, v3, v4
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
; error: inst4: ABI expects v0 at return pointer offset 0, got %rsi

}

Any idea what might be going on and what I should do from here?

Full debug logs
     Running `target/debug/clif-util compile -dtp /home/fitzgen/cranelift/filetests/wasm/multi-val-return-i64.clif --target x86_64`
 DEBUG cranelift_codegen::timing::details > timing: Starting Parsing textual Cranelift IR, (during <no pass>)
 DEBUG cranelift_codegen::timing::details > timing: Ending Parsing textual Cranelift IR
 DEBUG cranelift_codegen::timing::details > timing: Starting Compilation passes, (during <no pass>)
 DEBUG cranelift_codegen::timing::details > timing: Starting Verify Cranelift IR, (during Compilation passes)
 DEBUG cranelift_codegen::timing::details > timing: Starting Control flow graph, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details > timing: Ending Control flow graph
 DEBUG cranelift_codegen::timing::details > timing: Starting Dominator tree, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details > timing: Ending Dominator tree
 DEBUG cranelift_codegen::timing::details > timing: Starting Verify CPU flags, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details > timing: Ending Verify CPU flags
 DEBUG cranelift_codegen::timing::details > timing: Ending Verify Cranelift IR
 DEBUG cranelift_codegen::context         > Compiling:
function %return_4_i64s() -> i64, i64, i64, i64 fast {
ebb0:
    v0 = iconst.i64 0
    v1 = iconst.i64 1
    v2 = iconst.i64 2
    v3 = iconst.i64 3
    return v0, v1, v2, v3
}

 DEBUG cranelift_codegen::timing::details > timing: Starting Control flow graph, (during Compilation passes)
 DEBUG cranelift_codegen::timing::details > timing: Ending Control flow graph
 DEBUG cranelift_codegen::timing::details > timing: Starting Legalization, (during Compilation passes)
 DEBUG cranelift_codegen::legalizer::boundary > Adding 1 special-purpose arguments to return v0, v1, v2, v3
 DEBUG cranelift_codegen::timing::details     > timing: Ending Legalization
 DEBUG cranelift_codegen::context             > Legalized:
function %return_4_i64s(i64 sret [%rdi]) -> i64 [sret(0x0)], i64 [sret(0x8)], i64 [sret(0x10)], i64 [sret(0x18)], i64 sret [%rax] fast {
                                ebb0(v4: i64):
[RexOp1pu_id#b8,sret(0)]            v0 = iconst.i64 0
[RexOp1pu_id#b8,sret(8)]            v1 = iconst.i64 1
[RexOp1pu_id#b8,sret(16)]           v2 = iconst.i64 2
[RexOp1pu_id#b8,sret(24)]           v3 = iconst.i64 3
[RexOp1st#8089]                     store notrap aligned v0, v4
[RexOp1stDisp8#8089]                store notrap aligned v1, v4+8
[RexOp1stDisp8#8089]                store notrap aligned v2, v4+16
[RexOp1stDisp8#8089]                store notrap aligned v3, v4+24
[Op1ret#c3]                         return v0, v1, v2, v3, v4
}

 DEBUG cranelift_codegen::timing::details     > timing: Starting Verify Cranelift IR, (during Compilation passes)
 DEBUG cranelift_codegen::timing::details     > timing: Starting Control flow graph, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details     > timing: Ending Control flow graph
 DEBUG cranelift_codegen::timing::details     > timing: Starting Dominator tree, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details     > timing: Ending Dominator tree
 DEBUG cranelift_codegen::timing::details     > timing: Starting Verify CPU flags, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details     > timing: Ending Verify CPU flags
 DEBUG cranelift_codegen::timing::details     > timing: Ending Verify Cranelift IR
 DEBUG cranelift_codegen::timing::details     > timing: Starting Dominator tree, (during Compilation passes)
 DEBUG cranelift_codegen::timing::details     > timing: Ending Dominator tree
 DEBUG cranelift_codegen::timing::details     > timing: Starting Remove unreachable blocks, (during Compilation passes)
 DEBUG cranelift_codegen::timing::details     > timing: Ending Remove unreachable blocks
 DEBUG cranelift_codegen::timing::details     > timing: Starting Verify Cranelift IR, (during Compilation passes)
 DEBUG cranelift_codegen::timing::details     > timing: Starting Control flow graph, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details     > timing: Ending Control flow graph
 DEBUG cranelift_codegen::timing::details     > timing: Starting Dominator tree, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details     > timing: Ending Dominator tree
 DEBUG cranelift_codegen::timing::details     > timing: Starting Verify CPU flags, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details     > timing: Ending Verify CPU flags
 DEBUG cranelift_codegen::timing::details     > timing: Ending Verify Cranelift IR
 DEBUG cranelift_codegen::timing::details     > timing: Starting Register allocation, (during Compilation passes)
 DEBUG cranelift_codegen::timing::details     > timing: Starting RA liveness analysis, (during Register allocation)
 DEBUG cranelift_codegen::timing::details     > timing: Ending RA liveness analysis
 DEBUG cranelift_codegen::timing::details     > timing: Starting Verify live ranges, (during Register allocation)
 DEBUG cranelift_codegen::timing::details     > timing: Ending Verify live ranges
 DEBUG cranelift_codegen::timing::details     > timing: Starting RA coalescing CSSA, (during Register allocation)
 DEBUG cranelift_codegen::regalloc::coalescing > Coalescing for:
function %return_4_i64s(i64 sret [%rdi]) -> i64 [sret(0x0)], i64 [sret(0x8)], i64 [sret(0x10)], i64 [sret(0x18)], i64 sret [%rax] fast {
                                ebb0(v4: i64):
[RexOp1pu_id#b8,sret(0)]            v0 = iconst.i64 0
[RexOp1pu_id#b8,sret(8)]            v1 = iconst.i64 1
[RexOp1pu_id#b8,sret(16)]           v2 = iconst.i64 2
[RexOp1pu_id#b8,sret(24)]           v3 = iconst.i64 3
[RexOp1st#8089]                     store notrap aligned v0, v4
[RexOp1stDisp8#8089]                store notrap aligned v1, v4+8
[RexOp1stDisp8#8089]                store notrap aligned v2, v4+16
[RexOp1stDisp8#8089]                store notrap aligned v3, v4+24
[Op1ret#c3]                         return v0, v1, v2, v3, v4
}

 DEBUG cranelift_codegen::regalloc::coalescing > After union-find phase:
 DEBUG cranelift_codegen::timing::details      > timing: Ending RA coalescing CSSA
 DEBUG cranelift_codegen::timing::details      > timing: Starting Verify Cranelift IR, (during Register allocation)
 DEBUG cranelift_codegen::timing::details      > timing: Starting Control flow graph, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details      > timing: Ending Control flow graph
 DEBUG cranelift_codegen::timing::details      > timing: Starting Dominator tree, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details      > timing: Ending Dominator tree
 DEBUG cranelift_codegen::timing::details      > timing: Starting Verify CPU flags, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details      > timing: Ending Verify CPU flags
 DEBUG cranelift_codegen::timing::details      > timing: Ending Verify Cranelift IR
 DEBUG cranelift_codegen::timing::details      > timing: Starting Verify live ranges, (during Register allocation)
 DEBUG cranelift_codegen::timing::details      > timing: Ending Verify live ranges
 DEBUG cranelift_codegen::timing::details      > timing: Starting Verify CSSA, (during Register allocation)
 DEBUG cranelift_codegen::timing::details      > timing: Ending Verify CSSA
 DEBUG cranelift_codegen::timing::details      > timing: Starting RA spilling, (during Register allocation)
 DEBUG cranelift_codegen::regalloc::spilling   > Spilling for:
function %return_4_i64s(i64 sret [%rdi]) -> i64 [sret(0x0)], i64 [sret(0x8)], i64 [sret(0x10)], i64 [sret(0x18)], i64 sret [%rax] fast {
                                ebb0(v4: i64):
[RexOp1pu_id#b8,sret(0)]            v0 = iconst.i64 0
[RexOp1pu_id#b8,sret(8)]            v1 = iconst.i64 1
[RexOp1pu_id#b8,sret(16)]           v2 = iconst.i64 2
[RexOp1pu_id#b8,sret(24)]           v3 = iconst.i64 3
[RexOp1st#8089]                     store notrap aligned v0, v4
[RexOp1stDisp8#8089]                store notrap aligned v1, v4+8
[RexOp1stDisp8#8089]                store notrap aligned v2, v4+16
[RexOp1stDisp8#8089]                store notrap aligned v3, v4+24
[Op1ret#c3]                         return v0, v1, v2, v3, v4
}

 DEBUG cranelift_codegen::regalloc::spilling   > Spilling ebb0:
 DEBUG cranelift_codegen::regalloc::spilling   > Inst v0 = iconst.i64 0, Pressure[ 1+0/14 0+0/16 ]
 DEBUG cranelift_codegen::regalloc::spilling   > Inst v1 = iconst.i64 1, Pressure[ 2+0/14 0+0/16 ]
 DEBUG cranelift_codegen::regalloc::spilling   > Inst v2 = iconst.i64 2, Pressure[ 3+0/14 0+0/16 ]
 DEBUG cranelift_codegen::regalloc::spilling   > Inst v3 = iconst.i64 3, Pressure[ 4+0/14 0+0/16 ]
 DEBUG cranelift_codegen::regalloc::spilling   > Inst store.i64 notrap aligned v0, v4, Pressure[ 5+0/14 0+0/16 ]
 DEBUG cranelift_codegen::regalloc::spilling   > Inst store.i64 notrap aligned v1, v4+8, Pressure[ 5+0/14 0+0/16 ]
 DEBUG cranelift_codegen::regalloc::spilling   > Inst store.i64 notrap aligned v2, v4+16, Pressure[ 5+0/14 0+0/16 ]
 DEBUG cranelift_codegen::regalloc::spilling   > Inst store.i64 notrap aligned v3, v4+24, Pressure[ 5+0/14 0+0/16 ]
 DEBUG cranelift_codegen::regalloc::spilling   > Inst return v0, v1, v2, v3, v4, Pressure[ 5+0/14 0+0/16 ]
 DEBUG cranelift_codegen::regalloc::spilling   >   reguse: v4@op4/fixed
 DEBUG cranelift_codegen::timing::details      > timing: Ending RA spilling
 DEBUG cranelift_codegen::timing::details      > timing: Starting Verify Cranelift IR, (during Register allocation)
 DEBUG cranelift_codegen::timing::details      > timing: Starting Control flow graph, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details      > timing: Ending Control flow graph
 DEBUG cranelift_codegen::timing::details      > timing: Starting Dominator tree, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details      > timing: Ending Dominator tree
 DEBUG cranelift_codegen::timing::details      > timing: Starting Verify CPU flags, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details      > timing: Ending Verify CPU flags
 DEBUG cranelift_codegen::timing::details      > timing: Ending Verify Cranelift IR
 DEBUG cranelift_codegen::timing::details      > timing: Starting Verify live ranges, (during Register allocation)
 DEBUG cranelift_codegen::timing::details      > timing: Ending Verify live ranges
 DEBUG cranelift_codegen::timing::details      > timing: Starting Verify CSSA, (during Register allocation)
 DEBUG cranelift_codegen::timing::details      > timing: Ending Verify CSSA
 DEBUG cranelift_codegen::timing::details      > timing: Starting RA reloading, (during Register allocation)
 DEBUG cranelift_codegen::regalloc::reload     > Reload for:
function %return_4_i64s(i64 sret [%rdi]) -> i64 [sret(0x0)], i64 [sret(0x8)], i64 [sret(0x10)], i64 [sret(0x18)], i64 sret [%rax] fast {
                                ebb0(v4: i64):
[RexOp1pu_id#b8,sret(0)]            v0 = iconst.i64 0
[RexOp1pu_id#b8,sret(8)]            v1 = iconst.i64 1
[RexOp1pu_id#b8,sret(16)]           v2 = iconst.i64 2
[RexOp1pu_id#b8,sret(24)]           v3 = iconst.i64 3
[RexOp1st#8089]                     store notrap aligned v0, v4
[RexOp1stDisp8#8089]                store notrap aligned v1, v4+8
[RexOp1stDisp8#8089]                store notrap aligned v2, v4+16
[RexOp1stDisp8#8089]                store notrap aligned v3, v4+24
[Op1ret#c3]                         return v0, v1, v2, v3, v4
}

 DEBUG cranelift_codegen::regalloc::reload     > Reloading ebb0:
 DEBUG cranelift_codegen::timing::details      > timing: Ending RA reloading
 DEBUG cranelift_codegen::timing::details      > timing: Starting Verify Cranelift IR, (during Register allocation)
 DEBUG cranelift_codegen::timing::details      > timing: Starting Control flow graph, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details      > timing: Ending Control flow graph
 DEBUG cranelift_codegen::timing::details      > timing: Starting Dominator tree, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details      > timing: Ending Dominator tree
 DEBUG cranelift_codegen::timing::details      > timing: Starting Verify CPU flags, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details      > timing: Ending Verify CPU flags
 DEBUG cranelift_codegen::timing::details      > timing: Ending Verify Cranelift IR
 DEBUG cranelift_codegen::timing::details      > timing: Starting Verify live ranges, (during Register allocation)
 DEBUG cranelift_codegen::timing::details      > timing: Ending Verify live ranges
 DEBUG cranelift_codegen::timing::details      > timing: Starting Verify CSSA, (during Register allocation)
 DEBUG cranelift_codegen::timing::details      > timing: Ending Verify CSSA
 DEBUG cranelift_codegen::timing::details      > timing: Starting RA coloring, (during Register allocation)
 DEBUG cranelift_codegen::regalloc::coloring   > Coloring for:
function %return_4_i64s(i64 sret [%rdi]) -> i64 [sret(0x0)], i64 [sret(0x8)], i64 [sret(0x10)], i64 [sret(0x18)], i64 sret [%rax] fast {
                                ebb0(v4: i64):
[RexOp1pu_id#b8,sret(0)]            v0 = iconst.i64 0
[RexOp1pu_id#b8,sret(8)]            v1 = iconst.i64 1
[RexOp1pu_id#b8,sret(16)]           v2 = iconst.i64 2
[RexOp1pu_id#b8,sret(24)]           v3 = iconst.i64 3
[RexOp1st#8089]                     store notrap aligned v0, v4
[RexOp1stDisp8#8089]                store notrap aligned v1, v4+8
[RexOp1stDisp8#8089]                store notrap aligned v2, v4+16
[RexOp1stDisp8#8089]                store notrap aligned v3, v4+24
[Op1ret#c3]                         return v0, v1, v2, v3, v4
}

 DEBUG cranelift_codegen::regalloc::coloring   > Coloring ebb0:
 DEBUG cranelift_codegen::regalloc::coloring   > Start ebb0 with entry-diversion set to
      { }
 DEBUG cranelift_codegen::regalloc::coloring   > Coloring v0 = iconst.i64 0
    from [ GPR: acdb--s-89012345 FPR: 0123456789012345 FLAG: f ]
 DEBUG cranelift_codegen::regalloc::coloring   >     glob [ GPR: acdb--sd89012345 FPR: 0123456789012345 FLAG: f ]
 DEBUG cranelift_codegen::regalloc::coloring   >     color v0 -> %rax
 DEBUG cranelift_codegen::regalloc::coloring   > Coloring v1 = iconst.i64 1
    from [ GPR: -cdb--s-89012345 FPR: 0123456789012345 FLAG: f ]
 DEBUG cranelift_codegen::regalloc::coloring   >     glob [ GPR: acdb--sd89012345 FPR: 0123456789012345 FLAG: f ]
 DEBUG cranelift_codegen::regalloc::coloring   >     color v1 -> %rcx
 DEBUG cranelift_codegen::regalloc::coloring   > Coloring v2 = iconst.i64 2
    from [ GPR: --db--s-89012345 FPR: 0123456789012345 FLAG: f ]
 DEBUG cranelift_codegen::regalloc::coloring   >     glob [ GPR: acdb--sd89012345 FPR: 0123456789012345 FLAG: f ]
 DEBUG cranelift_codegen::regalloc::coloring   >     color v2 -> %rdx
 DEBUG cranelift_codegen::regalloc::coloring   > Coloring v3 = iconst.i64 3
    from [ GPR: ---b--s-89012345 FPR: 0123456789012345 FLAG: f ]
 DEBUG cranelift_codegen::regalloc::coloring   >     glob [ GPR: acdb--sd89012345 FPR: 0123456789012345 FLAG: f ]
 DEBUG cranelift_codegen::regalloc::coloring   >     color v3 -> %rbx
 DEBUG cranelift_codegen::regalloc::coloring   > Coloring store.i64 notrap aligned v0, v4
    from [ GPR: ------s-89012345 FPR: 0123456789012345 FLAG: f ]
 DEBUG cranelift_codegen::regalloc::coloring   >     glob [ GPR: acdb--sd89012345 FPR: 0123456789012345 FLAG: f ]
 DEBUG cranelift_codegen::regalloc::coloring   > Coloring store.i64 notrap aligned v1, v4+8
    from [ GPR: ------s-89012345 FPR: 0123456789012345 FLAG: f ]
 DEBUG cranelift_codegen::regalloc::coloring   >     glob [ GPR: acdb--sd89012345 FPR: 0123456789012345 FLAG: f ]
 DEBUG cranelift_codegen::regalloc::coloring   > Coloring store.i64 notrap aligned v2, v4+16
    from [ GPR: ------s-89012345 FPR: 0123456789012345 FLAG: f ]
 DEBUG cranelift_codegen::regalloc::coloring   >     glob [ GPR: acdb--sd89012345 FPR: 0123456789012345 FLAG: f ]
 DEBUG cranelift_codegen::regalloc::coloring   > Coloring store.i64 notrap aligned v3, v4+24
    from [ GPR: ------s-89012345 FPR: 0123456789012345 FLAG: f ]
 DEBUG cranelift_codegen::regalloc::coloring   >     glob [ GPR: acdb--sd89012345 FPR: 0123456789012345 FLAG: f ]
 DEBUG cranelift_codegen::regalloc::coloring   > Coloring return v0, v1, v2, v3, v4
    from [ GPR: ------s-89012345 FPR: 0123456789012345 FLAG: f ]
 DEBUG cranelift_codegen::regalloc::solver     > reassign_in(v4:GPR, %rdi -> %rax)
 DEBUG cranelift_codegen::regalloc::solver     > add_var(v0:GPR, from=%rax)
 DEBUG cranelift_codegen::regalloc::solver     > -> new var: v0(GPR, from %rax, in, out)
 DEBUG cranelift_codegen::regalloc::coloring   >     kill v4 in %rdi (local GPR)
 DEBUG cranelift_codegen::regalloc::coloring   >     kill v0 in %rax (local GPR)
 DEBUG cranelift_codegen::regalloc::coloring   >     kill v1 in %rcx (local GPR)
 DEBUG cranelift_codegen::regalloc::coloring   >     kill v2 in %rdx (local GPR)
 DEBUG cranelift_codegen::regalloc::coloring   >     kill v3 in %rbx (local GPR)
 DEBUG cranelift_codegen::regalloc::coloring   >     glob [ GPR: acdb--sd89012345 FPR: 0123456789012345 FLAG: f ]
 DEBUG cranelift_codegen::regalloc::solver     > collect_moves: [v0:GPR(%rax -> %rsi), v4:GPR(%rdi -> %rax)]
 DEBUG cranelift_codegen::regalloc::solver     > move #0: v0:GPR(%rax -> %rsi)
 DEBUG cranelift_codegen::regalloc::solver     > move #1: v4:GPR(%rdi -> %rax)
 DEBUG cranelift_codegen::timing::details      > timing: Ending RA coloring
 DEBUG cranelift_codegen::timing::details      > timing: Starting Verify Cranelift IR, (during Register allocation)
 DEBUG cranelift_codegen::timing::details      > timing: Starting Control flow graph, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details      > timing: Ending Control flow graph
 DEBUG cranelift_codegen::timing::details      > timing: Starting Dominator tree, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details      > timing: Ending Dominator tree
 DEBUG cranelift_codegen::timing::details      > timing: Starting Verify CPU flags, (during Verify Cranelift IR)
 DEBUG cranelift_codegen::timing::details      > timing: Ending Verify CPU flags
 DEBUG cranelift_codegen::timing::details      > timing: Ending Verify Cranelift IR
 DEBUG cranelift_codegen::timing::details      > timing: Starting Verify live ranges, (during Register allocation)
 DEBUG cranelift_codegen::timing::details      > timing: Ending Verify live ranges
 DEBUG cranelift_codegen::timing::details      > timing: Starting Verify value locations, (during Register allocation)
 DEBUG cranelift_codegen::timing::details      > timing: Ending Verify value locations
 DEBUG cranelift_codegen::timing::details      > timing: Ending Register allocation
 DEBUG cranelift_codegen::timing::details      > timing: Ending Compilation passes
function %return_4_i64s(i64 sret [%rdi]) -> i64 [sret(0x0)], i64 [sret(0x8)], i64 [sret(0x10)], i64 [sret(0x18)], i64 sret [%rax] fast {
                                ebb0(v4: i64 [%rdi]):
[RexOp1pu_id#b8,%rax]               v0 = iconst.i64 0
[RexOp1pu_id#b8,%rcx]               v1 = iconst.i64 1
[RexOp1pu_id#b8,%rdx]               v2 = iconst.i64 2
[RexOp1pu_id#b8,%rbx]               v3 = iconst.i64 3
[RexOp1st#8089]                     store notrap aligned v0, v4
[RexOp1stDisp8#8089]                store notrap aligned v1, v4+8
[RexOp1stDisp8#8089]                store notrap aligned v2, v4+16
[RexOp1stDisp8#8089]                store notrap aligned v3, v4+24
[RexOp1rmov#8089]                   regmove v0, %rax -> %rsi
[RexOp1rmov#8089]                   regmove v4, %rdi -> %rax
[Op1ret#c3]                         return v0, v1, v2, v3, v4
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
; error: inst4: ABI expects v0 at return pointer offset 0, got %rsi

}

; 1 verifier error detected (see above). Compilation aborted.

One idea I was considering (but don't actually know if this is a good idea or just a kludgy workaround) was introducing a sret_store (or store_copy) instruction that would return a new copy of the value that is located at the stored location (similar to copy and spill).

Basically, this would turn the previous example into:

function %return_4_i64s(i64 sret [%rdi]) -> i64 [sret(0x0)], i64 [sret(0x8)], i64 [sret(0x10)], i64 [sret(0x18)], i64 sret [%rax] fast {
ebb0(v4: i64):
    v0 = iconst.i64 0
    v1 = iconst.i64 1
    v2 = iconst.i64 2
    v3 = iconst.i64 3
    v5 = sret_store v4+0, v0
    v6 = sret_store v4+8, v0
    v7 = sret_store v4+16, v0
    v8 = sret_store v4+24, v0
    return v5, v6, v7, v8, v4

But I didn't really want to pursue this further without getting feedback from y'all.


Thanks!

@bjorn3
Copy link
Contributor

bjorn3 commented Oct 16, 2019

When a value is stored into a stack slot, it is normally done using spill, not store. The former says that the value moved, while the later stores a copy.

@bnjbvr
Copy link
Member

bnjbvr commented Oct 16, 2019

These two lines in the log:

 DEBUG cranelift_codegen::regalloc::solver     > reassign_in(v4:GPR, %rdi -> %rax)
 DEBUG cranelift_codegen::regalloc::solver     > add_var(v0:GPR, from=%rax)

mean that the solver (during the coloring phase of regalloc) is making room so that v4 can be in rax, which is the ABI return register (when only one value is returned); then v0 (which already lives in rax) must be diverted, so it must be added as a solver variable so rax is available for v4.

Is it possible instead to change the ABI location requirements of return values in cranelift-codegen/src/isa/x86/abi.rs when a function returns multiple values? Maybe even create the stack slot for the returned data structure's base there.

@fitzgen
Copy link
Member Author

fitzgen commented Oct 16, 2019

When a value is stored into a stack slot, it is normally done using spill, not store.

Yeah, the tricky thing that is different from spill is that the stack slot isn't in the callee's frame, the caller allocates the space for the sret in its frame and passes a pointer into the callee.

@fitzgen
Copy link
Member Author

fitzgen commented Oct 16, 2019

then v0 (which already lives in rax) must be diverted, so it must be added as a solver variable so rax is available for v4.

But the thing is v0 shouldn't be considered to live in %rax, since it got stored into its sret slot and I updated the ValueLocation to reflect that. So it should be fine to overwrite %rax rather than try and divert its current value.

Basically, the problem is that I don't know why (a) updating the ValueLocation isn't enough to convince the register allocator of that, and (b) how to fix this (do I need to do more than update the ValueLocation? does the register allocator need to be fixed?).

Is it possible instead to change the ABI location requirements of return values in cranelift-codegen/src/isa/x86/abi.rs when a function returns multiple values? Maybe even create the stack slot for the returned data structure's base there.

I can look into it / thought I was doing that by setting the values' ArgumentLoc to the new RetPtrOffset variant.

Regarding specifically creating the stack slot when looking at return values, I don't think that can work since the stack slot is in the caller's frame, not the callee's frame.

@bjorn3
Copy link
Contributor

bjorn3 commented Oct 16, 2019

Basically, the problem is that I don't know why (a) updating the ValueLocation isn't enough to convince the register allocator of that, and (b) how to fix this (do I need to do more than update the ValueLocation? does the register allocator need to be fixed?).

[RexOp1pu_id#b8,sret(0)] v0 = iconst.i64 0 is impossible. There is no encoding of iconst which stores its result at a certain memory location. It only supports storing in a register.

One idea I was considering (but don't actually know if this is a good idea or just a kludgy workaround) was introducing a sret_store (or store_copy) instruction that would return a new copy of the value that is located at the stored location (similar to copy and spill).

I think this is the proper solution, as you would need a separate machine instruction to store the result anyway in many cases and every machine instruction corresponds to a clif instruction. (While the conditional trap instructions map to multiple machine instructions, they are an exceptional case.)

// Make room for the maximum amount of incoming return pointer space we'll
// need.
//
// TODO FITZGEN: alignment?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a comment re. alignment requirements of this return region on the stack, or if this affects alignment of the computed layout? (if the latter, the layout is currently always made to leave the stack aligned)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is regarding the alignment required by the return values. For example, if for some reason there is an i32 stack slot, and then we need another stack slot for 8-byte-aligned data, we would need to add padding after the i32 slot (or rearrange them) and I'm just not sure yet how much of this infra already exists and is done for me automatically or not, and how to express these requirements if it does exist.

@@ -56,6 +58,14 @@ pub fn layout_stack(frame: &mut StackSlots, alignment: StackSize) -> CodegenResu
.ok_or(CodegenError::ImplLimitExceeded)?;
outgoing_max = max(outgoing_max, offset);
}
StackSlotKind::RetPtr => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a kind of silly question: where are StackSlot with kind RetPtr added? I'd expect iterating all calls in a function and adding the necessary slots from that, but don't see anything of the sort.

edit: I think I see the relevant TODO in legalizer/boundary.rs :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a followup, and I think there ought to be a test case (or plural!) around this, if we're reusing the largest RetPtr slot for different multi-return calls with spills, I'd like to be sure that we know this stack slot is clobbered on reuse. That way Cranelift can spill out of this slot into somewhere that return values can be safely read from. eg v0 should be moved out of the stack-returns area in this clif, which I think is pretty close to valid:

function %retptr_reuse() fast {
    sig0 = () -> i32, i32, i32, i32, i32, i32 fast
    sig1 = () -> f32, f32, f32, f32, f32, f32 fast

    ireturns = u0:0 sig0
    freturns = u0:1 sig1

ebb0():
    v0, v1, v2, v3, v4, v5 = call icalls()
    v6, v7, v8, v9, v10, v11 = call fcalls()
    ; I'm concerned v6 will overwrite v0, as things are currently
}

I don't know if the notion of stack slots being reused for different purposes is even something Cranelift knows about right know?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't really started on implementing the caller side yet; sorry I should have been more clear in the op.

Yeah I think it is worth not attempting to reuse these stack slots at all in the first pass implementation, and then circling back to add the reuse as a new PR.

let (sig_ref, args) = match dfg[inst].analyze_call(&dfg.value_lists) {
CallInfo::Direct(func, args) => (dfg.ext_funcs[func].signature, args),
let (sig_ref, args) = match func.dfg[inst].analyze_call(&func.dfg.value_lists) {
CallInfo::Direct(f, args) => (func.dfg.ext_funcs[f].signature, args),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small name comment, target or dest instead of f?

ret_ptr_param.location = loc;
sig.params.insert(0, ret_ptr_param);
}
ArgAction::Convert(_) => unreachable!(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct to phrase this as unreachable!("`pointer_type` shouldn't require conversion")?

@bnjbvr
Copy link
Member

bnjbvr commented Oct 17, 2019

Any idea what might be going on and what I should do from here?

Ok, I re-read it and read your code too. v0 is not dead, since it's referenced in the return instruction. Does it work when you only keep the last value (i.e. the ret struct point) in the IR return instruction, when legalizing the signature? There's apparently no need to keep the other values live in the ret instruction, since they're manually moved to their final location before the actual ret.

@fitzgen
Copy link
Member Author

fitzgen commented Oct 17, 2019

Thanks for the feedback and tips @bjorn3, @iximeow, and @bnjbvr!

Does it work when you only keep the last value (i.e. the ret struct point) in the IR return instruction, when legalizing the signature? There's apparently no need to keep the other values live in the ret instruction, since they're manually moved to their final location before the actual ret.

I think this actually makes more sense than the sret_store instruction, since we need loads on the caller side as well, and if we have return values that don't really "exist" until loaded that seems funky given my new/better understanding of what's going on here.

I'll try legalizing this signature

function %return_4_i64s() -> i64, i64, i64, i64 fast

into

function %return_4_i64s(v0: i64 sret) -> i64 sret fast

and legalizing returns into stores, as before, except with only the single return retptr value

;; before
ebb0:
    return v0, v1, v2, v3

;; after
ebb0(v4: i64):
    v0 = iconst.i64 0
    v1 = iconst.i64 1
    v2 = iconst.i64 2
    v3 = iconst.i64 3
    store notrap aligned v0, v4
    store notrap aligned v1, v4+8
    store notrap aligned v2, v4+16
    store notrap aligned v3, v4+24
    return v4

and finally legalizing calls like this

;; before
ebb0:
    v0, v1, v2, v3 = call %return_4_i64s

;; after
    ss0 = ret_ptr 32
ebb0:
    v4 = stack_addr.i64 ss0
    v5 = call %return_4_i64s
    v0 = stack_load.i64 ss0
    v1 = stack_load.i64 ss0+8
    v2 = stack_load.i64 ss0+16
    v3 = stack_load.i64 ss0+24

This feels very promising and I think I can actually remove a lot of the extra ArgumentLoc and ValueLoc stuff that I added.

Looking back, I don't know why I was so determined to keep the multiple return values in the signature, instead of letting them dissolve away into the loads and stores like they actually do at the machine level...

@bjorn3
Copy link
Contributor

bjorn3 commented Oct 23, 2019

Looking back, I don't know why I was so determined to keep the multiple return values in the signature, instead of letting them dissolve away into the loads and stores like they actually do at the machine level...

It could help with a consumer generating debuginfo for return values from the post compilation ir.

@fitzgen fitzgen force-pushed the many-multi-value-returns branch 6 times, most recently from ef4bf0a to 19b884f Compare October 28, 2019 23:07
@fitzgen fitzgen marked this pull request as ready for review October 28, 2019 23:08
@fitzgen fitzgen changed the title [WIP] Many multi-value returns Many multi-value returns Oct 28, 2019
@fitzgen fitzgen requested a review from bnjbvr October 28, 2019 23:09
@fitzgen
Copy link
Member Author

fitzgen commented Oct 28, 2019

Ok! I have all tests passing plus a bunch of new tests specifically for multi-value stuff :)

I think this is ready for review! r? @bnjbvr

Copy link
Member

@bnjbvr bnjbvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is on the right path, great work.

  • Re: the commit message of Support arbitrary numbers of return values, could you remove the examples please? The test cases provide enough context for what's going on imo.
  • Can you add tests for call_indirect, please?
  • This one is trickier, because i'm not sure we have enough runtime support in the run test framework for this (in particular with respect to linking), but: it would be extra nice to have run tests for functions returning multiple values and checking that the returned values are correct.
  • Have you considered de-grouping return values, that is, instead of having a single RetPtr stack slot, have several OutgoingRet stack slots? I wonder if this wouldn't make a few things simpler, like, no need to precompute the size of the entire ret stack space, so the two big loops when legalizing a call instruction could be fused.

cranelift-codegen/meta/src/isa/x86/encodings.rs Outdated Show resolved Hide resolved
cranelift-codegen/src/ir/extfunc.rs Outdated Show resolved Hide resolved
cranelift-codegen/src/isa/x86/abi.rs Outdated Show resolved Hide resolved
cranelift-codegen/src/isa/x86/abi.rs Outdated Show resolved Hide resolved
cranelift-codegen/src/isa/x86/abi.rs Outdated Show resolved Hide resolved
cranelift-codegen/src/legalizer/boundary.rs Show resolved Hide resolved
cranelift-codegen/src/legalizer/boundary.rs Outdated Show resolved Hide resolved
cranelift-codegen/src/legalizer/boundary.rs Show resolved Hide resolved
cranelift-codegen/src/ir/stackslot.rs Outdated Show resolved Hide resolved
cranelift-codegen/src/legalizer/boundary.rs Outdated Show resolved Hide resolved
@fitzgen
Copy link
Member Author

fitzgen commented Oct 30, 2019

@bnjbvr

This one is trickier, because i'm not sure we have enough runtime support in the run test framework for this (in particular with respect to linking), but: it would be extra nice to have run tests for functions returning multiple values and checking that the returned values are correct.

Yeah, the test run mode doesn't support calls, since it compiles each function individually. We have calls working in wasmtime and tests that they work right, so would you be OK with skipping test run tests here?

@fitzgen
Copy link
Member Author

fitzgen commented Oct 30, 2019

Have you considered de-grouping return values, that is, instead of having a single RetPtr stack slot, have several OutgoingRet stack slots? I wonder if this wouldn't make a few things simpler, like, no need to precompute the size of the entire ret stack space, so the two big loops when legalizing a call instruction could be fused.

In order to pass in a single sret pointer, we need to ensure that the space for each of these returns are contiguous and that they are each aligned. I think upholding those properties would only be harder / less clear that they are being upheld if we tried to explode the ret_ptr stack slots into multiple pieces.

Instead, let the existing frame layout algorithm decide where they should go.
@fitzgen fitzgen merged commit 7a4957c into bytecodealliance:master Nov 5, 2019
@fitzgen fitzgen deleted the many-multi-value-returns branch November 5, 2019 22:36
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants